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Abstract. B-sheet topology prediction is a major unresolved 
problem in modern computational biology. It is a 
challenging intermediate step toward the protein tertiary 
structure prediction. Different methods have been provided 
to deal with the problem of determining the f-sheet 
topology. Here, ab-initio probability-based methods called 
"BetaProbel" and "BetaProbe2" are utilized to specify the 
B-sheet topology. In these methods, the stability and the 
frequency of -strand pairwise interaction and f-sheet 
conformation are spotted. To predict more frequent 
interactions between f-strand pairs, besides pairwise 
alignment probability, the probability of occurring B-strand 
pairwise interaction is considered to compute the score of 
the interactions. Furthermore, to determine the B-strand 
pairwise alignment probability more accurately, a dynamic 
programming approach is utilized. In addition, the integer 
programming optimization is combined with the 
probabilities of B-strand pairwise interactions to determine 
the B-sheet topology. Moreover, the B-sheet conformation 
probability is considered to give better chances to more 
observed conformations for selection. Experimental results 
show that BetaProbel and BetaProbe2 significantly 
outperform the most recent integer programming-based 
method with respect to B-sheet topology prediction. 
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1.Introduction 


Proteins perform critical functions within the living 
organisms. Biologists believe that the functionality of 
proteins is determined by their tertiary structures. 
Therefore, it is important to specify the protein structure. 
Further, the conventional empirical methods to determine 
the structure of protein, namely, X-ray crystallography and 
Nuclear Magnetic Resonance (NMR) spectroscopy are very 
costly, time-consuming, and sometimes impossible. In 
addition, now, from the 30 million proteins with known 
primary structures in the protein databases [1], only the 
tertiary structures of 30 thousand of them have been 
determined by experimental methods [2]. Therefore, there 
is a huge gap between the number of known primary 
structures and the number of determined tertiary structures. 


Manuscript received May 16, 2016; accepted September 24, 
2016. 

Department of Computer Engineering, Engineering Faculty, 
Ferdowsi University of Mashhad, Mashhad, Iran. 

*The corresponding author's e-mail is: 
eghdami.mahdie@mail.um.ac.ir 


Hence, insufficiency of empirical methods leads to utilizing 
computational methods in protein structure prediction 
problem. 

One of the most frequent elements in the protein 
structure is B-sheet which consists of separate sections 
known as f-strands. -strands are typically six to eight 
amino acids long [3] that interact with amino acids of other 
B-strands and make paired f-strands (partners). The 
interaction between two fi-strands can occur in two 
different forms (parallel or anti-parallel) depending on their 
orientation given by the position of the B-strands’ N- and C- 
termini [4]. Each amino acid in a -strand can make at most 
two hydrogen bonds with other ones in the paired stand. 
The interactions between the amino acid residues of the 
paired B-strands are known as a B-contact map. 

B-sheets can be open or closed. Open f-sheets have two 
edge strands and they are the most common types of B- 
sheets. Fig. 1 shows an example of an open f-sheet type, 
where four B-strands interact. On the other hand, in the 
closed ones a circle is formed by a hydrogen bond between 
the first strand and the last one. 


Fig. 1. Open B-sheet of a protein with PDB (Protein Data Bank) id 
1NZOD. £-strands that form the B-sheet are numbered in 
sequential order. 


B-sheet topology prediction is regarded as one of the 
most important unresolved problems toward the tertiary 
structure prediction of proteins [5]. Correct prediction of B- 
sheet topology remains challenging because of hydrogen 
bond formations between linearly distant B-sheet residues 
[4]. Furthermore, the global covariations and constraints 
characteristic of B-sheet structures have not been well 
exploited [4].The B-sheet topology prediction provides 
valuable information for predicting protein three- 
dimensional structure [6], [7], designing new proteins and 
new drugs [8], [9] and determining folding pathways [10], 
[11]. 
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The main goal of predicting B-sheet topology from the 
protein’s amino acids is to determine the organization of B- 
strands in the B-sheets. This includes identifying B-strand 
members of each B-sheet and describing fB-sheets by 
specifying paired ß-strands and their interaction types. 
Further, B-contact maps are determined in B-sheet structure 
prediction. Different methods have been proposed to 
address the problem of predicting B-sheet topology which 
will be described in the next section. 

In this article, we present BetaProbel [12] and 
BetaProbe2, ab-initio probability based methods for B-sheet 
topology prediction. The main advantage of the proposed 
methods as compared to the previous researches is that we 
make use of the fact that more frequent and more stable 
conformations should have greater chances of being 
selected. For this purpose, the score of an interaction 
between each two f-strands is computed considering both 
pairwise alignment probability and pairwise interaction 
probability. Moreover, in order to make more accurate 
alignments, the B-strand optimum pairwise alignment is 
found using a dynamic programming approach. 
Furthermore, combining integer optimization with the ß- 
strand pairwise interaction probability improves the 
accuracy of the predicted interactions. In addition, using ß- 
sheet conformation probability in the last step of 
BetaProbel leads to predicting more frequent and more 
stable conformations. 

In the rest of this paper, first, related studies are reviewed 
in Section 2. Then, the details of the proposed methods will 
be described in Section 3. Finally, the performances of the 
proposed methods are compared with the most recent 
integer programming-based {-sheet prediction method in 
Section 4. 


2. Related Work 


Most f-sheet topology prediction methods utilize contact 
maps and strands alignment. Any improvement in the 
accuracy of these fields leads to a higher accuracy in 
determining the architecture of B-sheets. In this section first 
the related works in these fields are introduced. Then, some 
B-sheet prediction methods are explained. 

Specifying the protein contact map is the first step in 
determining its final structure. Mainly, a contact map is 
expressed by a two-dimensional matrix. For two amino 
acids r; and rj, if the value of the i-th row and the j-th 
column (O<contact Map (i, j) <1) is closer to one then they 
are more likely to interact with each other in the final 
structure. In other words, the likelihood of their relationship 
in the final structure of proteins is higher. NNcon [13], 
DNcon [14], SVMcon [15] and Distill [16] can be 
mentioned as contact map prediction methods. CMAPpro 
[17], PSICOV[18] and PhyCMAP [19] are the most recent 
methods which include contact map prediction. 

So far, methods with high accuracy and acceptable 
execution time have been suggested for the sequence 
alignment problem. Further, pairwise sequence alignment is 
the most common technique used in B-sheet prediction 
methods. The most usual approach to determine the best 
alignment between two strands is dynamic programming. 


Many efforts have been made to address the problem of 
predicting B-sheet topology. These works can be divided 
into two major categories: homology-based methods and 
ab-initio methods. The homology-based methods such as 
SMURF [20], SMURFLite [21], and MRFy [22] use 
homological information of proteins for recognizing their 
topologies. On the other hand, ab-initio methods only 
consider amino acids’ pairing potentials and statistical 
information. In this article, we concentrate on the ab-initio 
B-sheet topology prediction methods. They utilize different 
approaches such as statistical potentials[23], information 
theory[24], Bayesian models and exploration of entire 
search space[25], linear programming [5], [26], [27], 
hidden Markov models [28], and graph matching 
algorithms [4]. These approaches can be divided into two 
major categories[29]: in one category, all possible B- 
topologies are enumerated, and a score for each complete B- 
topology is computed. Then, the B-topology with the 
highest score is selected as the best one [7], [25]. In the 
other category, in order to predict the B-sheet topology of a 
protein, pseudo-energy is assigned to each pair of B-strands. 
Then the problem of determining the best B-topology is 
reduced to maximizing the  strand-to-strand contact 
potentials of the protein [5], [4], [26], [27], [28], [30]. 

BetaPro [4] was the first method to take into 
consideration the global nature of B-sheet topologies. In this 
method, three stages are used to predict B-topologies. Jones 
[31] takes advantage of linear programming to predict the 
secondary structure of the protein and B-sheet topologies. In 
[27], BetaPro was combined with linear programming to 
predict B-sheet topologies. Also, Rajgaria et al. [30] 
presented a method to determine the tertiary structure of 
proteins. In this method, strand pairing scores and contact 
maps are computed using linear programming. BetaZa[25] 
is a Bayesian approach which was introduced for proteins 
up to six B-strands. The conformational features were 
modeled in a probabilistic framework. The model is a 
combination of prior knowledge about ß-strand 
arrangements with pairing potentials between the strands 
amino acid. Also, to select the optimum ß-sheet 
architecture, using some heuristics, the search space was 
reduced. A dynamic programming was used to determine 
the B-strands optimum pairwise alignment. In the proposed 
dynamic programing, any number of gaps were allowed. As 
a result of exploration approach of the entire search space, 
BetaZa has a high time complexity. BeST [5] and BCov 
[26] predict the f-sheet topology using integer 
programming. BCov determines the B-sheet topology in 
three steps: first, it computes the residue contact propensity 
using PSICOV[18]; then, it computes the score of each 
possible B-strand pairing. Finally, an integer programming 
optimization is used to determine the B-sheet topology by 
finding the best solution according to the constraints and 
the pairing scores. In BCov two B-strands are paired only 
according to their alignment scores and the stability of 
conformations are not considered. Ruczinski et al. [7] 
showed that the arrangement of B-strands into B-sheets is 
not random. Based on the observations, there is a distinct 
pattern for f-strands arrangements. Some of the 
arrangements are unstable. Thus, they are never seen in 
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nature. On the other hand, some particular orientations are 
more favorable than others. In addition, models for 
computing the probability of open B-topologies for proteins 
were derived. The discriminative power of these models is 
reduced significantly because the number of possible p- 
strand organizations increase exponentially and there is not 
sufficient training data to reliably represent such 
conformations. Therefore, these models are limited to 
proteins that contain at most ten f-strands. In this research, 
we try to improve BCov by considering the stability and 
frequency of B-strand pairing and B-sheet conformation. 


3. Proposed Method 


In this article, two efforts are made to resolve the problem 
of predicting -sheet topology: BetaProbel and 
BetaProbe2.These efforts can predict both B-sheet topology 
and B-contact map. As previously mentioned, in BCov[26] 
two B-strands are paired based on only their alignment 
score; but, Ruczinski et al. [7] showed that the organization 
of B-strands into B-sheets is not random and there is a 
distinct pattern. Therefore, to improve BCov, we attempt to 
give greater chances to more stable and more frequent 
conformations during the selection. In this section, first, a 
general description of each attempt is presented. Then, the 
steps of the proposed methods are described in detail. 


3-1. First Effort: BetaProbel 


BetaProbel consists of three major steps: (i) in order to 
achieve more accurate alignments, a dynamic programming 
approach is used to compute the f-strand pairwise 
alignment probability. In addition, pairwise interaction 
probability of each pair of B-strands is computed according 
to [32]. Then, both pairwise alignment probability and 
pairwise interaction probability are utilized to compute the 
score of each interaction (ii) to determine the maximum 
total strand-to-strand contact potentials of the protein an 
integer programming optimization is used. In this step, to 
enforce more stable and more observed paired B-strands to 
be selected, pairwise interaction scores obtained in the 
previous step are utilized (iii) the best B-sheet topology is 
achieved according to paired strands determined in the 
previous step. To predict more stable conformations, B- 
sheet topology probabilities are considered. The pseudo 
code of BetaProbe! is illustrated in Pseudo codel. 


Computing f-strand Pairwise Interaction Score: Many 
methods have been proposed to find the best alignment 
between sequences [33], [34]. Here we concentrate on an 
alignment method which is especially proposed for B- 
strands. In BetaProbel the alignment probability of each 
two B-strands is computed based on the proposed method in 
BetaZa[25]. In this method, the Needleman-Wunsch 
algorithm [33][34] is used to compute the optimum 
alignment between each pair of B-strands in the parallel and 
anti-parallel directions. Then, the probability of the 
optimum alignment is computed by dividing the score of 
the best alignment by the sum of all possible alignments. To 
improve the accuracy of the alignments, the amino acid 
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pairing potentials are used which are computed especially 
based on the B-amino acids. 


Pseudocode 1: Probability-based algorithm for B-sheet topology 
prediction (BetaProbe1) 


% Input: protein’s strands 
“+ Output: an open f-sheet conformation with the highest 
probability 


Ke 


+ Step 1: Determining f-strand Pairwise Interaction Score 


for each pair of strands s; and s; do 


compute their parallel and anti-parallel pairwise 
alignment probabilities 


compute their parallel and anti-parallel pairwise 
interaction probabilities 


scores=alignment probability x interaction 
probability 


%° Step 2: Predicting the Closed B-Sheet Topology 


Solve the integer programming problem 


+ Step 3: Determining the Best Open f-Sheet Topology 


~ 


for each closed B-topology do 
for each interaction between two f-strands do 
Omit the interaction temporarily 


Compute the probability of the new open f- 
sheet 


Select the open f-sheet with the 


conformation probability. 


highest 


To store the pairwise alignment probability, a matrix 
called "PAP (Pairwise Alignment Probability)" with n rows 
and 2n columns is defined. In this matrix, n is the number 
of B-strands in the protein. Matrix PAP is defined as 
follows: 


Srarattel(Si»S;) if isn and j<n and j#i 
PAP(G,j)= Santi-paraitel(SirS}) if isn, n+1<j<2xn and jénti (2) 
0 if j=i or j=nti 


In Equation (1), Sparatte: (Si,S;) represents the probability of 
optimum alignment between strands s;, i=1,2,...,.n, and sj, 
j=1,2,...,n, where their interaction type is parallel. Also, 
Santi-paralle!. (SSj) represents the probability of optimum 
alignment between strands s;, i=1,2,...,n, and s;, j=1,2,...,n, 
where their interaction type is anti-parallel. The definition 
shows that the matrix PAP is divided into two sections with 
an equal number of columns. The left section is used to 
store the parallel alignment probabilities and the right 
section is used to store the anti-parallel ones. The Score 
matrix for the protein in Fig. 1 is shown in Fig. 2-(a). It is 
important to note that the alignment probability depends on 
the spatial ordering of strands [25]. Therefore, the score of 
the optimum alignment between non-bridge strands can be 
different. This is expressed in (2) and (3): 


Sparallel (Si sj) FS parallel(S) si) (2) 


56 eghdami et.al: B-sheet Topology Prediction Using Probability-based Integer Programming 


Santiparallel (Si,8;) #S antiparallel(SjSi) (3) 

According to [32], some B-strand pairs are more stable 
and they are more frequently observed in nature, as 
compared to others. Based on this observation, matrix "PIP 
(Pairwise Interaction Probability)" is defined to store the 
pairwise interaction probabilities of B-strands. The models 
derived by [32] were used to compute these probabilities. 
Matrix PIP contains n rows and 2Xn columns as defined in 


(4): 
Praraltel(SiS;) if i<n and j<n and j#i 
PIPG,j)= {room ifi<n, n+1<j<2xn and j#i+n (4) 
0 ifj=i or j=i+n 
In (4), Pparatiei(Sis8j) represents the probability of strands si 
and sj to make a parallel interaction in the final structure 
based on the protein characteristics such as the helical 
status and the number of residues between each two beta 
strands. Similarly, Prntiparatie(Si,$;) is the probability of 
strands s; and sj to make an antiparallel interaction. The 
spatial ordering of strands has no effect on the B-strand 
pairwise interaction probability. This is expressed in (5) and 
(6): 
Prarattet(Si;) = Praratte(Sp5i) (5) 


Pantiparattet(SpSj) = Pantiparaltel(SpSi) (6) 


In Fig. 2-(b), the matrix PIP is computed for the protein 
1NZOD. Then the scores of interactions between each pair 
of B-strands are determined. In the computation of each 
score, both pairwise interaction probability and pairwise 
alignment probability is considered. To store the scores of 
the interactions, a bi-dimensional nx2n matrix called 
"Score" is introduced, where n is the number of B-strands in 
the protein. The matrix definition is declared in (7). 


Score (i,j)=PAP(i,j) PIP(i,j) 1<i<n, 1<j<2xn (7) 


In the matrix Score definition, the first n columns 
represent the scores of parallel interactions. Similarly, the 
last n columns show anti-parallel ones. It is important to 
note that the score of an interaction between two strands 
depends on their spatial ordering. The Score matrix is 
illustrated in Fig.2-(c) for the protein 1NZOD. 


Prediction of the Closed B-Sheet Topology: Unlike BCov, 
in the integer optimization problem the pairwise interaction 
probabilities are considered in order to predict more stable 
paired B-strands. In addition, the integer programming 
model of BetaProbel is defined differently from the 
BCov's. As a result, the closed B-sheet topology is obtained 
by solving the integer problem in (8). 


n 2xn 
> X Score(i j) Xij) 
i=] j=l 
cl: XG j)E0, I} 1<i<n , 1<j<2xn 
c2: XCF IAXG DAX Aj t+n +X 6, i+) €f0, L} 
Y I<i<n, 1<j<2xn (8) 
c3: XF X(i,j) 0,1} VL Sin 
c4: X (XGA) +XGj+n)) Ef0,1} VIS <n 
c5: Dr XGA) + Xi- (XG, D+XG,i+n)) 1,2} 
V1Sisn 
c6: X (D=X(i,itn)=0 V 1<i<n 


maximize: 


subject to: 


(a) 


PAP= 
0 0 0.138 0.019 0 1 0.047 0.007 
0 0 0.023 0 1 0 0.019 0.089 
0.15 0.025 0 0.238 0.05 0.015 0 0 
9.027 0 0.437 0 0.015 0.055 0 0 
aent score 
(b) 
0 001 0.27 0.27 0 0.99 0.73 0.73 
prp =| 9-0! 0 0.01 0.27 0.99 0 0.99 0.73 
0.27 0.01 0 0.27 0.73 0.99 0 0.73 
0.27 0.27 0.27 0 0.73 0.73 0.73 O 
(c) 
Score= 
0 0 0.138 0.019 0 1 0.047 0.007 
0 0 0.023 0 1 0 0.019 0.089 
0.15 0.025 0 0.238 0.05 0.015 0 0 
0.027 0 0.437 0 0.015 0.055 0 0 
em e 


ction score 


Fig 2. (a) The matrix PAP for a protein with PDB ID 1NZOD 
computed by the dynamic programming algorithm [25]. (b) The 
matrix PIP for protein INZOD computed by using the pairwise 
interaction probabilities in [32]. (c) The matrix Score for protein 
INZOD computed by considering both pairwise alignment 
probability and pairwise interaction probability in this paper. 


X is a nX2n binary matrix (constraint cl) in which non- 
zero entries show an interaction between two related 
strands. c2 constraint shows whether the interaction 
between two strands is parallel or antiparallel. c3 and c4 
constraints ensure that all strands have at most one strand 
partner on either side. Furthermore, each strand can pair 
with at least one and at most two other B-strands (constraint 
c5). 

In Fig. 3, the matrix X and the predicted closed B-sheet 
topology for protein 1NZOD are shown. 


(a) 
0 0 0 0 0 1 0 0 
x= 0 0 0 0 0 0 0 1 
1 0 0 0 0 0 0 0 
0 0 1 0 0 0 0 0 
(b) 


Fig. 3. (a) The matrix X obtained by solving the integer program. 
(b) The predicted closed B-sheet topology for the protein in Fig. 1. 
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Determining the Best Open f-Sheet Topology: In the 
previous step, paired B-strands and their interaction types 
are determined by the integer program solution. The 
predicted interactions make closed B-sheets, in other words, 
each strand has two partners. To extend the proposed 
method for the open f-sheets, the ß-sheet topology 
probabilities determined by [32] are used. In this step, the 
probability of each possible open sheet is computed. Then 
the most probable one is selected as the best B-sheet 
topology. To enumerate all possible open B-sheets, one of 
the interactions of the closed one is omitted at a time. The 
process of determining the best open B-sheet topology is 
illustrated in Fig.4. In addition, Fig. 5 shows all possible 
open B-sheets for the closed one in the Fig. 3-(b). 


get the closed sheet 
conformations 


omit the it" interaction 
score" = the configuration probability of 


Yes 
No 
score’*=score™ 
topology™t=new sheet topology 


Yes 


report topology'est 


k<number of closed sheets 


Fig. 4. The process of determining the most probable open f-sheet 


Dif 


3-2. Second Effort: BetaProbe2 


BetaProbe2 consists of two major steps: (i) similar to 
BetaProbel, the score of each interaction is computed by 
considering both pairwise alignment probability and 
pairwise interaction probability. To obtain more accurate 
alignments, a dynamic programming approach is used to 
compute the alignment probability of each pair of B-strands 
(ii) to unravel the problem of determining the B-sheet 
topology, an integer programming optimization is 
introduced. Unlike BetaProbel, the integer problem is 
defined to maximize the product of the interaction scores. 
In this step, both pairwise interaction probabilities and 
pairwise alignment probabilities are utilized to give a 
greater chance to more stable and more observed paired B- 
strands for selection. Unlike BetaProbel, the [-sheet 
topology achieved by the integer program solution is not 
closed. The pseudo code of BetaProbe?2 is illustrated in 
Pseudocode 2. 


Closed f-sheet topology Open f-sheet topology 


Fig. 5. All possible open B-sheets from a closed one. The gray cell 
shows the best open B-sheet topology for protein INZOD. 
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Pseudocode 2: Probability-based algorithm for B-sheet topology 
prediction (BetaProbe2) 


“* Input: protein’s strands 
“* Output: an open f-sheet conformation with the highest 
probability 


$, 


“* Step 1: determining B-strand pairwise Interaction Score 


for each pair of strands s; and s; do 


compute their parallel and anti-parallel 
pairwise alignment probabilities 


compute their parallel and antiparallel 
pairwise interaction probabilities 


scores=alignment probability xinteraction 
probability 


$, 


“* Step 2: Prediction of the -Sheet Topology 
for each pair of strands s; and s; do 


scores” =log(scores) 


Solve the integer programming problem 


Computing f-strand Pairwise Interaction Score: Similar 
to BetaProbe1, first the elements of matrices PIP and PAP 
are computed as in BetaProbel. Then, the score of 
interaction between each pair of B-strands is determined. 


Determining the ß-sheet Topology: The problem of 
specifying the best B-sheet topology is reduced to an integer 
optimization. By assuming that the event of existing an 
interaction between two strands is independent of other p- 
strand interactions, the probability of the occurrence of 
several interactions is computed by the product of their 
probabilities. Therefore, an integer optimization is used to 
maximize the product of ß-strand pairwise interaction 
scores, because each pairwise interaction score shows the 
probability of occurrence of an interaction between two 
strands according to the pairwise alignment probability and 
the pairwise interaction probability. Since pairwise 
interaction scores have positive values and the logarithm 
function is ascending, it is possible to maximize the sum of 
the logarithms of the pairwise interaction scores instead of 
maximizing their product. Then, the problem of 
determining the B-sheet topology becomes an integer linear 
problem represented in (9). Note that the constraints of the 
problem are the same as (8). In Fig. 6, the matrix X and the 
final B-sheet topology for protein 1NZOD are presented. 


n 2xn 


maximize: > >, log(Score(ij)) X(i,j) 
i=] j=] 
subjectto: cl: XGj)JEf{0,1}V Isisn, 1<j<2xn 


c2: XGj)AXGU+X GA jtN) +X, i +n) E{0, 1} 
VI1<i<n, 1<j<2xn (9) 
c3: Dr X(i,j) {0,1} V1Sisn 
c4: 0" (XG A)+XGjtn)) €{0,1} VIS <n 
c5: L727 XG) + Xi- XG, D+XG,i+n)) 1,2} 
V1Sisn 
c6: X ú, D=X(,itn)=0 V 1<i<n 


(a) 
0 0 0 0 0 1 0 0 
X= 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 1 0 0 0 0 0 
rallel 
(b) 


“sme 


Fig 6. (a) matrix X represent the result of solving integer 
programming problem. (b) the final predicted B-sheet topology 


4. Results 


In this section, first, the evaluation metrics and the data set 
are described. Then, the results of evaluating BetaProbel 
and BetaProbe2 are presented. 


Evaluation metrics: To evaluate the performance of the 
proposed methods, well-known metrics in (10), (11) and 
(12) are used. These metrics have been used to evaluate 
state-of-the-art methods [5], [26], [25]: 


_. OTP 
Precision= TptFP x100 (10) 
D TP 
Recall= PIN x100 (11) 
Elscores 2xPrecision xRecall a2) 


Precision+Recall 


Note that TP, FP, and FN represent true positives, false 
positives, and false negatives values, respectively. 


Dataset: We used the BetaSheet916 set for the evaluation. 
This dataset is extracted from the PDB by [4]. It includes 
916 proteins. To perform cross-validation, it is split into 10- 
folds randomly and evenly. DSSP program [35] is used for 
assigning the secondary structure. In this article B-residues 
includes: (1) the extended B-strands (shown by E in the 
DSSP) and (2) the isolated B-bridges (shown B in the DSSP 
output). 


Cross validation: At each step in a cross-validation, one 
fold is considered as the test data and the remaining ones 
are the training set. Models are trained based on the training 
set. Predictions are determined in the test set. This process 
is repeated for all proteins in the original set. The accuracy 
measures are computed after the predictions are 
accomplished. 

We carried out three simulations. In the first simulation, 
a 10-fold cross-validation experiment was performed on the 
BetaSheet916 for proteins with less than or equal to four p- 
strands and less than three partners. Similarly, in the second 
simulation, the proposed method was evaluated on proteins 
with less than or equal to five B-strands with less than three 
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partners. The third simulation was performed on proteins 
with less than or equal to six B-strands with less than three 
partners. 

BetaProbel and BetaProbe2 were compared with the 
state-of-the-art method, BCov, which is also based on 
integer programming. For this purpose, in the first step of 
BCov, the residue pairing probabilities calculated in 
BetaPro were used. Then the methods were evaluated on 
the same data set. 

In Table 1, the performance of BetaProbe/ at the strand 
level is compared with the performance of BCov. The 
recall, precision, and Fl-score measures are shown in this 
table. 

Wilcoxon test for related samples has been utilized to 
determine whether there is a significant difference in the 
precision, recall, and Fl-score of the two methods. To 
perform the test, the data set was broken into ten 
subsidiaries as declared in the Dataset section. After that, 
the results of BetaProbel and BCov were evaluated for 
these subsets. The test showed that with an average error of 
5%, there is a significant difference between the recall of 
the two methods at the pairing direction level for proteins 
with up to six and up to five strands. This means that the 
recall improvement of BetaProbel compared with that of 
BCov is significantly meaningful. From Table 1, it can be 
concluded that besides using f-sheet conformation 
probabilities, considering pairwise interaction probabilities 
in the computation of B-strands interaction score and 
combining it with the integer programming greatly 
improves the accuracy of pairing directions. In Chart 1, 
Chart 2, and Chart 3 the recall, precision, and Fl-score of 
BetaProbel at pairing direction level is illustrated and 
compared to BCov's. 


Table 1. The performance of BetaProbel at strand level on 
proteins with 6 or fewer B-strands on BetaSheet916. 


Evaluation level Method Recall | Precision | Fl-score 
BCov<6 ° 79 84 82 
BetaProbel <6 73 69 71 
BCovs5° 81 86 83 
strand pairing 
BetaProbel<5 76 73 74 
BCov<4 ° 82 85 83 
BetaProbel <4 75 72 74 
BCov<6 64 68 66 
BetaProbel <6 70 67 68 
BCov<5 64 69 66 
Pairing direction 
BetaProbel<5 73 70 72 
BCov<4 72 75 73 
BetaProbel <4 74 70 72 


a)The evaluation is done on proteins with up to 6 B-strands 
b) The evaluation is done on proteins with up to 5 B-strands 
c) The evaluation is done on proteins with up to 4 B-strands 
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In Table 2, the performance of BetaProbe2 is compared 
to BCov at strand level. For the three subsets of proteins, 
the precision and the Fl-score measures of the proposed 
method at pairing direction level is better than BCov. 
Further, the precision of BetaProbe2 is better than BCov at 
strand pairing level. Wilcoxon test for related samples 
showed that with an average error of 5%, there is a 
significant difference between the precision of BCov and 
BetaProbe2 at the pairing direction level for all subsets of 
proteins. It can be concluded that the precision 
improvement of BetaProbe2 is significantly meaningful as 
compared with BCov's. 
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Chart 1: Recall comparison of BetaProbe1 to BCov 


N N 
BO 


ON 


la BCov 
èd BetaProbe1 


Precision(%) 
nna NN 
an w 


Dn Dn 
N A 


upto6 upto5 upto 4 
Number of Strands 


Chart 2: Precision comparison of BetaProbel to BCov 
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Chart 3: Fl-score comparison of BetaProbel to BCov 
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The reason for the improvement of proposed methods as 
compared with BCov is that adding pairwise interaction 
probabilities to the integer programming in the second step, 
enforces B-strand interactions which are more frequent in 
the nature to be selected with higher probabilities. In 
addition, to improve the pairwise alignments between ß- 
strands, a dynamic programming approach is utilized in 
which gaps are allowed. Furthermore, using the amino acid 
pairing potentials provided by the BetaZa in the first step 
has improved the accuracy. 


Table 2. The performance of BetaProbe? at strand level on 
proteins with 6 or fewer B-strands on Beta Sheet 916. 


Evaluation level Method Recall | Precision | Fl-score 
BCov<6 79 84 82 
BetaProbe2<6 65 85 73 
BCov<5 81 86 83 
strand pairing 
BetaProbe2<5 68 87 76 
BCov<4 82 85 83 
BetaProbe2<4 71 90 79 
BCov<6 64 68 66 
BetaProbe2<6 63 83 72 
BCovs<5 64 69 66 
Pairing direction 
BetaProbe2<5 67 87 75 
BCov<4 72 75 73 
BetaProbe2<4 71 89 79 


In Table3 and Table4 the results of BetaProbe2 are 
compared to BetaZa at the residue level and strand level, 
respectively. The same alignment technique is used in both 
methods. BetaZa searches the entire search space to find the 
best B-sheet topology. Although the execution time of 
BetaProbe2 is less than BetaZa, the precision of 
BetaProbe2 is better at pairing direction level. In addition, 
the precision of BetaProbe2 at strand pairing level and 
contact map level is comparable with BetaZa's. Comparing 
the recall, precision, and Fl-score measures of BetaProbe2 
at pairing direction level with the other method, the results 
are represented in Chart4, Chart 5, and Chart 6, 
respectively. 

In Table5 and Table6 the performance of BetaProbel 
and BetaProbe2 are represented at the residue level and 
strand level, respectively. As mentioned before, in 
BetaProbe1, the sum of the interaction scores is maximized 
in the integer programming step while in BetaProbe2 the 
product of the interaction scores is maximized. It leads to 
predicting fewer interactions between f-strands in 
BetaProbe2 because the scores of the interactions are in the 
range of zero and one. Subsequently, the predicted pairwise 
interactions are the most frequent ones. Therefore the 
precision of predicted interactions increases while the recall 
decreases. 


Table 3. The performance of BetaProbe2 at strand level on 
proteins with 6 or fewer fi-strands on Beta Sheet 916 


Evaluation level Method Recall | Precision | Fl-score 
BetaZa<6 78 80 79 
BetaProbe2<6 58 80 67 
BetaZa<5 80 80 80 
Contact map 
BetaProbe2<5 60 79 68 
BetaZa <4 82 82 82 
BetaProbe2<4 6l 80 69 


Table 4. The performance of BetaProbe2 at strand level on 
proteins with 6 or fewer B-strands on Beta Sheet 916. 


Evaluation level Method Recall | Precision | Fl-score 
BetaZa<6 83 84 84 
BetaProbe2<6 65 85 73 
BetaZa<5 87 88 87 
strand pairing 
BetaProbe2<5 68 87 76 
BetaZa <4 91 91 91 
BetaProbe2<4 71 90 79 
BetaZa <6 80 81 81 
BetaProbe2<6 63 83 72 
BetaZa <5 84 85 84 
Pairing direction 
BetaProbe2<5 67 87 75 
BetaZa <4 88 88 88 
BetaProbe2<4 71 89 79 
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Chart 4: Recall comparison of BetaProbe2 to other methods 
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Chart 5: Precision comparison of BetaProbe2 to other methods 
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Chart 6: Precision comparison of BetaProbe2 to other methods 


5. Conclusion and Future Work 


The issue of determining the topology of B-sheets is 
considered as a challenging problem. In this paper, 
BetaProbel and BetaProbe2, two __ probability-based 
methods for the B-sheet topology prediction, are introduced. 
In these methods, first, the optimum pairwise alignment 
probabilities of B-strands are determined using the dynamic 
programming approach while any number of gaps are 
allowed. Then, the probability of the occurrence of an 
interaction is computed. After that, the score of on 
interaction is computed utilizing both pairwise alignment 
probability and pairwise interaction probability. Finally, we 
reduced the problem of finding the B-sheet topology to an 
integer optimization. 10-fold cross-validation experiments 
are performed to evaluate the proposed methods. The 
results show that these methods outperform the most recent 
integer programming-based method[26]. The major 
novelties in this research can be summarized as follow: 

1. Considering both pairwise alignment probability and 
pairwise interaction probability to compute the score of 
an interaction between two B-strands; 

2. Combining the probability of occurrence of an 
interaction with the integer programming; 

3. Considering B-sheet conformation probability in the 
nature to predict more frequent B-topologies; 

4. Considering the spatial ordering of B-strands in B-sheets 
in the integer programming; 

5. The ability of the proposed methods to predict the ß- 
sheet structure for proteins with multiple B-sheets; 

6. The ability of the proposed methods to predict the ß- 
sheet topology for proteins with closed B-sheets. 


61 


The performance of predictions can be improved even 
further. By combining residue pairing propensities with 
PSICOV [18] ones, the methods can become more accurate. 
Our methods can predict proteins with six or fewer p- 
strands with less than three partners. This can be extended 
to predict proteins with a higher number of B-strands and 
higher order partners by extending probabilities and adding 
new constraints to the integer programming step. 


Table 5. The performance of BetaProbel and BetaProbe2 at 
residue level on proteins with 6 or fewer B-strands on Beta 


Sheet 916. 
Evaluation level Method Recall | Precision | Fl-score 
BetaProbel<6 63 66 64 
BetaProbe2<6 58 80 67 
BetaProbel <5 64 66 65 
Contact map 
BetaProbe2<5 60 79 68 
BetaProbel <4 64 67 65 
BetaProbe2<4 61 80 69 


Table 6. The performance of BetaProbel and BetaProbe 2 at 
strand level on proteins with 6 or fewer B-strands on Beta 


Sheet 916. 
Evaluation level Method Recall | Precision | Fl-score 
BetaProbel <6 73 69 71 
BetaProbe2<6 65 85 73 
BetaProbel <5 76 73 74 
strand pairing 
BetaProbe2<5 68 87 76 
BetaProbel <4 75 72 74 
BetaProbe2<4 71 90 79 
BetaProbel <6 70 67 68 
BetaProbe2<6 63 83 72 
BetaProbel <5 73 70 72 
Pairing direction 
BetaProbe2<5 67 87 75 
BetaProbel <4 74 70 72 
BetaProbe2<4 71 89 79 
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